Search CORE

1,321 research outputs found

STATS - A Point Access Method for Multidimensional Clusters.

Author: AK Jain
CC Aggarwal
O Maimon
S Suthaharan
Publication venue: Springer Verlag
Publication date: 01/08/2017
Field of study

The ubiquity of high-dimensional data in machine learning and data mining applications makes its efficient indexing and retrieval from main memory crucial. Frequently, these machine learning algorithms need to query specific characteristics of single multidimensional points. For example, given a clustered dataset, the cluster membership (CM) query retrieves the cluster to which an object belongs. To efficiently answer this type of query we have developed STATS, a novel main-memory index which scales to answer CM queries on increasingly big datasets. Current indexing methods are oblivious to the structure of clusters in the data, and we thus, develop STATS around the key insight that exploiting the cluster information when indexing and preserving it in the index will accelerate look up. We show experimentally that STATS outperforms known methods in regards to retrieval time and scales well with dataset size for any number of dimensions

Crossref

Spiral - Imperial College Digital Repository

Extending local features with contextual information in graph kernels

Author: CC Aggarwal
G San Martino Da
M Collins
M Collins
N Shervashidze
S Vishwanathan
SVN Vishwanathan
T Gärtner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Graph kernels are usually defined in terms of simpler kernels over local substructures of the original graphs. Different kernels consider different types of substructures. However, in some cases they have similar predictive performances, probably because the substructures can be interpreted as approximations of the subgraphs they induce. In this paper, we propose to associate to each feature a piece of information about the context in which the feature appears in the graph. A substructure appearing in two different graphs will match only if it appears with the same context in both graphs. We propose a kernel based on this idea that considers trees as substructures, and where the contexts are features too. The kernel is inspired from the framework in [6], even if it is not part of it. We give an efficient algorithm for computing the kernel and show promising results on real-world graph classification datasets.Comment: To appear in ICONIP 201

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Padova

Approximate Minimum Diameter

Author: A Jørgensen
C Fan
CC Aggarwal
M Löffler
PK Agarwal
R Fleischer
R Fleischer
S Har-Peled
W Ju
Publication venue
Publication date: 31/03/2017
Field of study

We study the minimum diameter problem for a set of inexact points. By inexact, we mean that the precise location of the points is not known. Instead, the location of each point is restricted to a contineus region (\impre model) or a finite set of points (\indec model). Given a set of inexact points in one of \impre or \indec models, we wish to provide a lower-bound on the diameter of the real points. In the first part of the paper, we focus on \indec model. We present an

O(2^{\frac{1}{\epsilon^d}} \cdot \epsilon^{-2d} \cdot n^3 )

time approximation algorithm of factor

(1+\epsilon)

for finding minimum diameter of a set of points in

d

dimensions. This improves the previously proposed algorithms for this problem substantially. Next, we consider the problem in \impre model. In

d

-dimensional space, we propose a polynomial time

\sqrt{d}

-approximation algorithm. In addition, for

d=2

, we define the notion of

\alpha

-separability and use our algorithm for \indec model to obtain

(1+\epsilon)

-approximation algorithm for a set of

\alpha

-separable regions in time

O(2^{\frac{1}{\epsilon^2}}\allowbreak . \frac{n^3}{\epsilon^{10} .\sin(\alpha/2)^3} )

arXiv.org e-Print Archive

Crossref

The Early Bird Catches The Term: Combining Twitter and News Data For Event Detection and Situational Awareness

Author: A Hermida
A Marcus
A Sadilek
CC Aggarwal
CC Chang
DA Broniatowski
E Aramaki
E Diaz-Aviles
F Chierichetti
H Abdelhaq
H Becker
H Kwak
J Yin
M Thelwall
M Walther
ML Hutwagner
P Shaver
R Long
Publication venue
Publication date: 09/04/2015
Field of study

Twitter updates now represent an enormous stream of information originating from a wide variety of formal and informal sources, much of which is relevant to real-world events. In this paper we adapt existing bio-surveillance algorithms to detect localised spikes in Twitter activity corresponding to real events with a high level of confidence. We then develop a methodology to automatically summarise these events, both by providing the tweets which fully describe the event and by linking to highly relevant news articles. We apply our methods to outbreaks of illness and events strongly affecting sentiment. In both case studies we are able to detect events verifiable by third party sources and produce high quality summaries

arXiv.org e-Print Archive

Crossref

PubMed Central

Spiral - Imperial College Digital Repository

Conformative Filtering for Implicit Feedback Data

Author: CC Aggarwal
D Goldberg
E Christakopoulou
J Pearl
K Wang
M Knott
NL Zhang
P Chen
T Chen
T Liu
TF Liu
W Pan
Y Koren
Y Koren
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/04/2019
Field of study

Implicit feedback is the simplest form of user feedback that can be used for item recommendation. It is easy to collect and is domain independent. However, there is a lack of negative examples. Previous work tackles this problem by assuming that users are not interested or not as much interested in the unconsumed items. Those assumptions are often severely violated since non-consumption can be due to factors like unawareness or lack of resources. Therefore, non-consumption by a user does not always mean disinterest or irrelevance. In this paper, we propose a novel method called Conformative Filtering (CoF) to address the issue. The motivating observation is that if there is a large group of users who share the same taste and none of them have consumed an item before, then it is likely that the item is not of interest to the group. We perform multidimensional clustering on implicit feedback data using hierarchical latent tree analysis (HLTA) to identify user `tastes' groups and make recommendations for a user based on her memberships in the groups and on the past behavior of the groups. Experiments on two real-world datasets from different domains show that CoF has superior performance compared to several common baselines

arXiv.org e-Print Archive

Crossref

Mining Uncertain Sequential Patterns in Iterative MapReduce

Author: B-S Jeong
CC Aggarwal
H Chernoff
J Jestes
M Muzammal
Y Tong
Z Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

This paper proposes a sequential pattern mining (SPM) algorithm in large scale uncertain databases. Uncertain sequence databases are widely used to model inaccurate or imprecise timestamped data in many real applications, where traditional SPM algorithms are inapplicable because of data uncertainty and scalability. In this paper, we develop an efficient approach to manage data uncertainty in SPM and design an iterative MapReduce framework to execute the uncertain SPM algorithm in parallel. We conduct extensive experiments in both synthetic and real uncertain datasets. And the experimental results prove that our algorithm is efficient and scalable

Crossref

IUPUIScholarWorks

Enhancement of Short Text Clustering by Iterative Classification

Author: C Zheng
CC Aggarwal
J Xu
S Kanj
S Shekhar
T Gollub
X Cheng
Y Du
Publication venue
Publication date: 30/01/2020
Field of study

Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.Comment: 30 pages, 2 figure

arXiv.org e-Print Archive

Crossref

Towards Efficient Sequential Pattern Mining in Temporal Uncertain Databases

Author: C-K Chui
C-K Chui
CC Aggarwal
H Zhang
J Jestes
M Muzammal
Y Tong
Z Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Uncertain sequence databases are widely used to model data with inaccurate or imprecise timestamps in many real world applications. In this paper, we use uniform distributions to model uncertain timestamps and adopt possible world semantics to interpret temporal uncertain database. We design an incremental approach to manage temporal uncertainty efficiently, which is integrated into the classic pattern-growth SPM algorithm to mine uncertain sequential patterns. Extensive experiments prove that our algorithm performs well in both efficiency and scalability

Crossref

IUPUIScholarWorks

A New Approach to Measuring Distances in Dense Graphs

Author: A Clauset
AK Jain
CC Aggarwal
KRUK Reddy
MEJ Newman
R Bonner
S Everitt
S Fortunato
U Zwick
VD Blondel
Publication venue: Springer, Cham
Publication date: 14/02/2019
Field of study

The problem of computing distances and shortest paths between vertices in graphs is one of the fundamental issues in graph theory. It is of great importance in many different applications, for example, transportation, and social network analysis. However, efficient shortest distance algorithms are still desired in many disciplines. Basically, the majority of dense graphs have ties between the shortest distances. Therefore, we consider a different approach and introduce a new measure to solve all-pairs shortest paths for undirected and unweighted graphs. This measures the shortest distance between any two vertices by considering the length and the number of all possible paths between them. The main aim of this new approach is to break the ties between equal shortest paths SP, which can be obtained by the Breadth-first search algorithm (BFS), and distinguish meaningfully between these equal distances. Moreover, using the new measure in clustering produces higher quality results compared with SP. In our study, we apply two different clustering techniques: hierarchical clustering and K-means clustering, with four different graph models, and for a various number of clusters. We compare the results using a modularity function to check the quality of our clustering results

Crossref

White Rose Research Online

On Coupling FCA and MDL in Pattern Mining

Author: A Gallo
B Ganter
CC Aggarwal
J Han
J Vreeken
M Mampaey
N Tatti
PD Grünwald
S Brin
SO Kuznetsov
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 25/06/2019
Field of study

International audiencePattern Mining is a well-studied field in Data Mining and Machine Learning. The modern methods are based on dynamically updating models, among which MDL-based ones ensure high-quality pattern sets. Formal concepts also characterize patterns in a condensed form. In this paper we study MDL-based algorithm called Krimp in FCA settings and propose a modified version that benefits from FCA and relies on probabilistic assumptions that underlie MDL. We provide an experimental proof that the proposed approach improves quality of pattern sets generated by Krimp

Crossref

INRIA a CCSD electronic archive server